Quality of White Wines by Alan Gou

Preliminary Exploration

Before I begin plotting the data, I want to first figure out a couple of things about the variables. First, how many of each quality are there?

summary(wf)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
table(wf$quality)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Dataset Information

There are 4898 observations of 12 variables. Each observation (or row) is 11 variables descirbing various chemical/physical aspects of a wine plus the median of the ratings given by judges of that wine, 0 being the lowest rating and 10 being the highest.

Goal of analyzing this dataset

Quality is the feature of interest - the goal of this analysis is to explore what other features of the data explain the quality the wines.

Some preliminary expectations

From what I have read in the readme for the dataset, I am expecting levels of sulfur dioxide to play a part in determining quality - it seems like there ought to be a balance of sulfur dioxide. Too much will cause a bad, sulfurous odor, while too little may make the wine not fresh. Beyond that, my non-existent knowledge of wine would have me expect that sugar levels, alcohol content, and salt content would all have some sort of effect on quality, though in what way I really have no idea at this point. I also expect acidity to be a factor in determining quality

Creating new ‘quality_level’ bucket

As you can see, there are no wines with ratings of 0, 1, 2, or 10. There are only 5 wines with ratings of 9 and 20 with ratings of 3. This seems like a good indication that I can group some of these variables together into buckets: “high”, “medium-high”, “medium”, “medium-low”, and “low”. I’ll do this by adding a new variable: quality_level. This will let me use geom_freqpoly and facet_wrap more effectively, since I won’t have one category with only 5 observations in it and another category with over 2000. Low is 3 and 4, medium low is 5, medium is 6, medium-high is 7, and high is 8 and 9.

Deeper Exploration - Investigating Correlations Between Features

A bit of grouping

Here, I try to perform some aggregations on sugar, pH, alcohol, and acidity to see if anything pops out. I will now make a few graphs. And just to note, I make a new feature called ‘total.acidity’ which is just the sum of fixed acidity, volatile acidity, and citric acid.

Investigating Acidity

I really wish the data for quality was more continuous - on a range of 1 to 10 based on an average of all the judges’ scores for that particular wine. But alas, the data gods are not so kind. Judging from the fitted lines, it would seem that, generally, higher quality wines have less fixed and volatile acidity than lower quality wines, while higher quality wines tend to have more citric acid than lower quality wines. In terms of general acidity, higher quality wines tend to have higher pH values, which makes sense. This means higher quality wines are more basic, i.e. less acidic, and this is obviously in line with the general trend that total acidity goes down as wine quality goes up. Still, these lines are not very strong fits - for volatile acidity, it seems tht a parabolic curve would be a better fit.

These graphs show that the correlation between acidity and pH is slightly less strong, but it still exists. Higher wine quality seems to predict lower acidity.

Investigating alcohol content

Alcohol content and wine quality are pretty closely correlated. We can use this in our model.

What seems interesting here is that while plotting sugar on its own against quality does not show much of a correlation, plotting residual sugar against alcohol and then coloring by quality seems to show that higher quality wines, which tend to have higher alcohol contents, also tend to have lower sugar levels than wines with lower alcohol contents. It is clear that plotting sugar with alcohol content strengthened both features.

As expected, both alcohol content and residual sugar are highly correlated with density. If we were to create a linear regression model for quality, we should avoid having all three of these variables in the model, as multicollinearity would become a significant problem.

Investigating salt content

Immediately we see that there is a strong likelihood that salt content is correlated with quality. We will further investigate.

These plots confirm that lower salt content is correlated with higher wine quality. What is interesting is the inverse correlation between alcohol and chlorides, which I would not have expected. It seems that there are no wines with low alcohol content and low chloride levels and no wines with high alcohol content and high chloride levels. I am not sure why that is - perhaps it is a side effect of making wine with high alcohol content, or that high quality wines are produced with the goal of high alcohol content and low salt content in mind. Regardless, they are correlated, so we should bear that in mind while constructing a model so as to keep multicollinearity at a minimum.

Investigating sulfate levels

These do not tell us all that much. Let us investigate further. We will move directly into plotting these features against other possible explanatory features to see if any unexpected results show up.

It becomes clearer here that lower total sulfur dioxide seems to be correlated with higher wine quality. And, as expected, free sulfur dioxide and total sulfur dioxide are correlated with each other.

Summary of findings so far

First, let us talk about the features other than the feature of interest that are correlated with each other. Some were obvious and expected, others are not: - alcohol and density - sugar and density - chlorides and alcohol - total sulfur dioxide and free sulfur dioxide - all the various acidities with pH

Now, let us list how these features relate to quality: - higher alcohol and higher quality - higher citric acid and higher quality - lower sugar and higher quality - lower total sulfur dioxide and higher quality - lower pH and higher quality - lower salt content and higher quality

With this information, we can improve on our expectations of what makes for a high quality wine. Good wines tend to have higher alcohol contents, fruitier flavor (due to higher citric acid content), lower sugar levels, lower salt levels, lower sulfur dioxide levels, and lower overall acidity. I have left out features such as density, which is too strongly correlated with more important features such as alcohol content and chloride levels, and sulphates, which does not seem to be correlated with quality and is only very slighty correlated with total sulfur dioxide.

Quick Model

## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Attaching package: 'memisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     collect, query, rename
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following objects are masked from 'package:base':
## 
##     as.array, trimws
## 
## 
## Attaching package: 'GGally'
## 
## The following object is masked from 'package:dplyr':
## 
##     nasa
## 
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:memisc':
## 
##     percent
## 
## Calls:
## m1: lm(formula = alcohol ~ quality, data = wf)
## m2: lm(formula = alcohol ~ quality + citric.acid, data = wf)
## m3: lm(formula = alcohol ~ quality + citric.acid + chlorides, data = wf)
## m4: lm(formula = alcohol ~ quality + citric.acid + chlorides + pH, 
##     data = wf)
## m5: lm(formula = alcohol ~ quality + citric.acid + chlorides + pH + 
##     residual.sugar, data = wf)
## m6: lm(formula = alcohol ~ quality + citric.acid + chlorides + pH + 
##     residual.sugar + total.sulfur.dioxide, data = wf)
## 
## =======================================================================================
##                           m1         m2         m3         m4         m5         m6    
## ---------------------------------------------------------------------------------------
## (Intercept)            6.957***   7.206***    8.284***   6.872***   9.358***   9.510***
##                       (0.106)    (0.115)     (0.120)    (0.345)    (0.316)    (0.305)  
## quality                0.605***   0.604***    0.524***   0.518***   0.480***   0.442***
##                       (0.018)    (0.018)     (0.017)    (0.017)    (0.016)    (0.015)  
## citric.acid                      -0.729***   -0.413***  -0.327**   -0.087      0.112   
##                                  (0.130)     (0.125)    (0.127)    (0.113)    (0.110)  
## chlorides                                   -15.566*** -15.400*** -14.243*** -12.450***
##                                              (0.710)    (0.710)    (0.634)    (0.619)  
## pH                                                       0.443***  -0.115      0.102   
##                                                         (0.102)    (0.092)    (0.089)  
## residual.sugar                                                     -0.096***  -0.075***
##                                                                    (0.003)    (0.003)  
## total.sulfur.dioxide                                                          -0.007***
##                                                                               (0.000)  
## ---------------------------------------------------------------------------------------
## R-squared                 0.190      0.195      0.267      0.270      0.419      0.459 
## adj. R-squared            0.190      0.195      0.266      0.269      0.418      0.459 
## sigma                     1.108      1.104      1.054      1.052      0.938      0.905 
## F                      1146.395    592.379    593.952    451.873    705.603    692.930 
## p                         0.000      0.000      0.000      0.000      0.000      0.000 
## Log-likelihood        -7450.661  -7435.065  -7205.503  -7195.982  -6636.054  -6459.235 
## Deviance               6009.118   5970.970   5436.702   5415.605   4308.757   4008.627 
## AIC                   14907.323  14878.130  14421.007  14403.964  13286.109  12934.469 
## BIC                   14926.812  14904.116  14453.490  14442.943  13331.585  12986.442 
## N                      4898       4898       4898       4898       4898       4898     
## =======================================================================================

Modeling

In the end, using just a pretty basic linear model, we get an R-squared of 0.459, which is not too shabby. Of course, this is far from a perfect model - a linear regression simply cannot capture all the subtleties of the data. I also included both chlorides and alcohol in the model, even though I already know that they are correlated with each other. Thus, there is some degree of multicollinearity that is negatively affecting the truth of the model.


Final Plots and Summary

Plot One

Description One

This plot shows very obviously that there is definitely a trend towards higher alcohol content as wine quality increases. Just having a higher alcohol content seems to be a huge factor in determining wine quality - the entire boxplot moves up for each increase in quality level, which is not something I would have expected. It really makes me wonder why exactly alcohol is so strongly correlated with wine quality, and whether that bears out in real life. This plot sparked much of the exploration in regard to whether other features were strongly correlated with alcohol - is higher alcohol content a result of a general higher-quality wine making process, or is it purposefully sought after in the wine making process? I spent much of my time trying to explore this angle in this report.

Plot Two

Description Two

I selected these first two plots because they reveal quirks of the data that you wouldn’t have been able to see otherwise. During my EDA, it was hard to see whether sugar content related at all to wine quality - different levels of sugar content seemed to be distributed quite evenly across all wine qualities. However, this plot immediately reveals two things: (1) higher quality alcohol does, in fact, have lower sugar levels, and (2) there are no high alcohol and high sugar content wines. The insights this plot offered me meant I now was willing to use residual sugar as a feature in the linear regression model I hoped to build, since it was clearly correlated with wine quality. And this paid off - adding residual.sugar to my linear model raised the R-squared value (unadjusted) from 0.270 to 0.419.

Plot Three

Description Three

This reveals a relationship between features that I had not expected at all. For some reason, alochol content seems to be inversely correlated with salt content - and high quality wines are overwhelmingly concentrated in the area of the plot where salt content is low and alcohol percentage is high.


Reflection

This project was intimidating at first because there were so many features. Which ones should I concentrate on? Which ones would actually have any effect? And once I plotted the distributions of each with regard to quality, I did not come out as elucidated as I had thought I would be - only alcohol and perhaps salt seemed to contribute in any way to wine quality. This was unlike the diamond data set, in which features were fewer and there were universally defined metrics for what made a better diamond.

Still, there were a few common sense hunches that I had regarding what would affect wine quality - I feel that oftentimes, our own intuition is where we begin in such investigations, and in the process of confirming or invalidating those intuitions, we discover new quirks and trends that would not have occurred to us without such exploration. That is what happened with me - I felt that sugar levels and acidity ought to have some significant effect on wine quality.

After trying various plots with sugar levels, I was about ready to give up. There seemed to be no rhyme nor reason with sugar content across different quality wines. However, when I finally plotted sugar vs. alcohol content and colored the points by quality, sugar’s inverse relationship with wine quality finally revealed itself. Needless to say, I was pleased. However, this plot also revealed sugar’s inverse correlation with alcohol, which made me wonder why exactly would there be a relationship between alcohol and sugar? Is it because of the fermentation process that converts sugar into ethanol, and therefore the higher the alcohol content, the lower the sugar level?

This induced me to investigate further the relationship between alcohol and other features, and I found, to my surprise, that chlorides and alcohol were also inversely correlated. Wines with high alcohol contents also had low salt contents, and were generally rated higher, than wines with low alcohol contents with high salt contenst, which were generally rated lower. In fact, there seemed to be a relative dearth of wines that had both high alcohol and high salt contents as well as both low alcohol and low salt contents. This begs the same question that the discovery of sugar’s relationship with alcohol evoked: was this a result of the wine making process that naturally meant high quality wines had high alcohol contents and low salt contents, or was this due to wine makers purposely choosing to make wines with these characteristics? I do not think this is a question that can be answered with EDA alone - it would require an understanding of the wine making process as well.

Once I got the ball rolling in mixing and matching features to see if anything strange and interesting popped out, it was a relatively straightforward process to see how acidity related to wine quality. Strangely enough, it turned out that higher citric acid was correlated with higher wine quality even though overall acidity (as measured by pH and my total acidity variable) was correlated with lower wine quality. I attributed this to higher citric acid levels making wines taste fruitier. Also, total acidity was dominated by fixed acidity - citric acid was a small enough component of total acidity that its level was nearly neglible in determining pH, so this was actually a finding that made sense. Too acidic of a wine probably tastes bad, but fruitier wine tastes better.

Looking forward

There are still many things that could be done. There are some combinations of features that I have not plotted - namely, that between sulphates and sulfur dioxide levels with density, and whether that could change anything in my analysis. Perhaps using more boxplots would also reveal some interesting things.

Also, if I were to spend more time on this, I would likely create more robust models for predicting wine quality - using naive Bayes, or vector models, or a logistic regression. My linear model had decent results, but is not as good of a model as a model could be.

I would like to actually compare white wines with red wines - there would probably be a lot of interesting insights into the character of these two wines, in terms of their various acidities, alcohol contents, sugar levels, etc., and what makes for a high quality red or white wine.